120 research outputs found
Improved benchmarks for computational motif discovery
Background
An important step in annotation of sequenced genomes is the identification of transcription factor binding sites. More than a hundred different computational methods have been proposed, and it is difficult to make an informed choice. Therefore, robust assessment of motif discovery methods becomes important, both for validation of existing tools and for identification of promising directions for future research.
Results
We use a machine learning perspective to analyze collections of transcription factors with known binding sites. Algorithms are presented for finding position weight matrices (PWMs), IUPAC-type motifs and mismatch motifs with optimal discrimination of binding sites from remaining sequence. We show that for many data sets in a recently proposed benchmark suite for motif discovery, none of the common motif models can accurately discriminate the binding sites from remaining sequence. This may obscure the distinction between the potential performance of the motif discovery tool itself versus the intrinsic complexity of the problem we are trying to solve. Synthetic data sets may avoid this problem, but we show on some previously proposed benchmarks that there may be a strong bias towards a presupposed motif model. We also propose a new approach to benchmark data set construction. This approach is based on collections of binding site fragments that are ranked according to the optimal level of discrimination achieved with our algorithms. This allows us to select subsets with specific properties. We present one benchmark suite with data sets that allow good discrimination between positive and negative instances with the common motif models. These data sets are suitable for evaluating algorithms for motif discovery that rely on these models. We present another benchmark suite where PWM, IUPAC and mismatch motif models are not able to discriminate reliably between positive and negative instances. This suite could be used for evaluating more powerful motif models.
Conclusion
Our improved benchmark suites have been designed to differentiate between the performance of motif discovery algorithms and the power of motif models. We provide a web server where users can download our benchmark suites, submit predictions and visualize scores on the benchmarks
Age-Associated Hyper-Methylated Regions in the Human Brain Overlap with Bivalent Chromatin Domains
PMCID: PMC3454416This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited
Identifying elemental genomic track types and representing them uniformly
<p>Abstract</p> <p>Background</p> <p>With the recent advances and availability of various high-throughput sequencing technologies, data on many molecular aspects, such as gene regulation, chromatin dynamics, and the three-dimensional organization of DNA, are rapidly being generated in an increasing number of laboratories. The variation in biological context, and the increasingly dispersed mode of data generation, imply a need for precise, interoperable and flexible representations of genomic features through formats that are easy to parse. A host of alternative formats are currently available and in use, complicating analysis and tool development. The issue of whether and how the multitude of formats reflects varying underlying characteristics of data has to our knowledge not previously been systematically treated.</p> <p>Results</p> <p>We here identify intrinsic distinctions between genomic features, and argue that the distinctions imply that a certain variation in the representation of features as genomic tracks is warranted. Four core informational properties of tracks are discussed: gaps, lengths, values and interconnections. From this we delineate fifteen generic track types. Based on the track type distinctions, we characterize major existing representational formats and find that the track types are not adequately supported by any single format. We also find, in contrast to the XML formats, that none of the existing tabular formats are conveniently extendable to support all track types. We thus propose two unified formats for track data, an improved XML format, BioXSD 1.1, and a new tabular format, GTrack 1.0.</p> <p>Conclusions</p> <p>The defined track types are shown to capture relevant distinctions between genomic annotation tracks, resulting in varying representational needs and analysis possibilities. The proposed formats, GTrack 1.0 and BioXSD 1.1, cater to the identified track distinctions and emphasize preciseness, flexibility and parsing convenience.</p
Segmentation of DNA sequences into twostate regions and melting fork regions
The accurate prediction and characterization of DNA melting domains by
computational tools could facilitate a broad range of biological applications.
However, no algorithm for melting domain prediction has been available until
now. The main challenges include the difficulty of mathematically mapping a
qualitative description of DNA melting domains to quantitative statistical
mechanics models, as well as the absence of 'gold standards' and a need for
generality. In this paper, we introduce a new approach to identify the twostate
regions and melting fork regions along a given DNA sequence. Compared with an
ad hoc segmentation used in one of our previous studies, the new algorithm is
based on boundary probability profiles, rather than standard melting maps. We
demonstrate that a more detailed characterization of the DNA melting domain map
can be obtained using our new method, and this approach is independent of the
choice of DNA melting model. We expect this work to drive our understanding of
DNA melting domains one step further.Comment: 17 pages, 8 figures; new introduction, added refs, minor change
Improving generalization of machine learning-identified biomarkers with causal modeling: an investigation into immune receptor diagnostics
Machine learning is increasingly used to discover diagnostic and prognostic
biomarkers from high-dimensional molecular data. However, a variety of factors
related to experimental design may affect the ability to learn generalizable
and clinically applicable diagnostics. Here, we argue that a causal perspective
improves the identification of these challenges, and formalizes their relation
to the robustness and generalization of machine learning-based diagnostics. To
make for a concrete discussion, we focus on a specific, recently established
high-dimensional biomarker - adaptive immune receptor repertoires (AIRRs). We
discuss how the main biological and experimental factors of the AIRR domain may
influence the learned biomarkers and provide easily adjustable simulations of
such effects. In conclusion, we find that causal modeling improves machine
learning-based biomarker robustness by identifying stable relations between
variables and by guiding the adjustment of the relations and variables that
vary between populations
Linguistically inspired roadmap for building biologically reliable protein language models
Deep neural-network-based language models (LMs) are increasingly applied to
large-scale protein sequence data to predict protein function. However, being
largely black-box models and thus challenging to interpret, current protein LM
approaches do not contribute to a fundamental understanding of
sequence-function mappings, hindering rule-based biotherapeutic drug
development. We argue that guidance drawn from linguistics, a field specialized
in analytical rule extraction from natural language data, can aid with building
more interpretable protein LMs that are more likely to learn relevant
domain-specific rules. Differences between protein sequence data and linguistic
sequence data require the integration of more domain-specific knowledge in
protein LMs compared to natural language LMs. Here, we provide a
linguistics-based roadmap for protein LM pipeline choices with regard to
training data, tokenization, token embedding, sequence embedding, and model
interpretation. Incorporating linguistic ideas into protein LMs enables the
development of next-generation interpretable machine-learning models with the
potential of uncovering the biological mechanisms underlying sequence-function
relationships.Comment: 27 pages, 4 figure
ImmunoLingo: Linguistics-based formalization of the antibody language
Apparent parallels between natural language and biological sequence have led
to a recent surge in the application of deep language models (LMs) to the
analysis of antibody and other biological sequences. However, a lack of a
rigorous linguistic formalization of biological sequence languages, which would
define basic components, such as lexicon (i.e., the discrete units of the
language) and grammar (i.e., the rules that link sequence well-formedness,
structure, and meaning) has led to largely domain-unspecific applications of
LMs, which do not take into account the underlying structure of the biological
sequences studied. A linguistic formalization, on the other hand, establishes
linguistically-informed and thus domain-adapted components for LM applications.
It would facilitate a better understanding of how differences and similarities
between natural language and biological sequences influence the quality of LMs,
which is crucial for the design of interpretable models with extractable
sequence-functions relationship rules, such as the ones underlying the antibody
specificity prediction problem. Deciphering the rules of antibody specificity
is crucial to accelerating rational and in silico biotherapeutic drug design.
Here, we formalize the properties of the antibody language and thereby
establish not only a foundation for the application of linguistic tools in
adaptive immune receptor analysis but also for the systematic immunolinguistic
studies of immune receptor specificity in general.Comment: 19 pages, 3 figure
Artificial intelligence-driven prediction of COVID-19-related hospitalization and death: a systematic review
AimTo perform a systematic review on the use of Artificial Intelligence (AI) techniques for predicting COVID-19 hospitalization and mortality using primary and secondary data sources.Study eligibility criteriaCohort, clinical trials, meta-analyses, and observational studies investigating COVID-19 hospitalization or mortality using artificial intelligence techniques were eligible. Articles without a full text available in the English language were excluded.Data sourcesArticles recorded in Ovid MEDLINE from 01/01/2019 to 22/08/2022 were screened.Data extractionWe extracted information on data sources, AI models, and epidemiological aspects of retrieved studies.Bias assessmentA bias assessment of AI models was done using PROBAST.ParticipantsPatients tested positive for COVID-19.ResultsWe included 39 studies related to AI-based prediction of hospitalization and death related to COVID-19. The articles were published in the period 2019-2022, and mostly used Random Forest as the model with the best performance. AI models were trained using cohorts of individuals sampled from populations of European and non-European countries, mostly with cohort sample size <5,000. Data collection generally included information on demographics, clinical records, laboratory results, and pharmacological treatments (i.e., high-dimensional datasets). In most studies, the models were internally validated with cross-validation, but the majority of studies lacked external validation and calibration. Covariates were not prioritized using ensemble approaches in most of the studies, however, models still showed moderately good performances with Area under the Receiver operating characteristic Curve (AUC) values >0.7. According to the assessment with PROBAST, all models had a high risk of bias and/or concern regarding applicability.ConclusionsA broad range of AI techniques have been used to predict COVID-19 hospitalization and mortality. The studies reported good prediction performance of AI models, however, high risk of bias and/or concern regarding applicability were detected
- …